Supplementary Figure 5 to 7

Author

Joanny Raby

Supplementary Figures 5 and 7 correspond to Section 2. Supplementary Figure 6 is first mentioned in the second results section of the paper, but its content primarily relates to Section 3.

These figures are separate for page size reason, to keep a reasonable loading time.

The formatting of the figures may differ slightly from those in the paper, but they display the same data points.

All code cells are folded by default. To view any cell, click “Code” to expand it, or use the code options near the main title above to unfold all at once.

Some code may be repeated, as the original Python notebook was designed for figures to be generated semi-independently.

Setup Code - Imports and co.

Setup imports.

Code
from __future__ import annotations

import itertools
import re
import tarfile
from pathlib import Path
from typing import Dict, List, Sequence

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from IPython.display import display
from plotly.subplots import make_subplots
from scipy.stats import zscore

from epiclass.utils.notebooks.paper.paper_utilities import (
    ASSAY,
    ASSAY_MERGE_DICT,
    ASSAY_ORDER,
    BIOMATERIAL_TYPE,
    CELL_TYPE,
    LIFE_STAGE,
    SEX,
    IHECColorMap,
    MetadataHandler,
    SplitResultsHandler,
    create_mislabel_corrector,
    PathChecker
)

CORE7_ASSAYS = ASSAY_ORDER[0:7]

Setup paths.

Code
# Root path
base_dir = Path.home() / "Projects/epiclass/output/paper"
PathChecker.check_directory(base_dir)

# More precise
base_data_dir = base_dir / "data"
base_fig_dir = base_dir / "figures"
table_dir = base_dir / "tables"
metadata_dir = base_data_dir / "metadata"

official_metadata_dir = metadata_dir / "epiatlas" / "official"
PathChecker.check_directory(official_metadata_dir)

# alias
paper_dir = base_dir

Setup colors.

Code
IHECColorMap = IHECColorMap(base_fig_dir)
assay_colors = IHECColorMap.assay_color_map
cell_type_colors = IHECColorMap.cell_type_color_map
sex_colors = IHECColorMap.sex_color_map

Setup metadata and prediction files handlers.

Code
split_results_handler = SplitResultsHandler()

metadata_handler = MetadataHandler(paper_dir)
metadata_v2 = metadata_handler.load_metadata("v2")
metadata_v2_df = metadata_v2.to_df()

Setup figures general settings.

Code
main_title_settings = {
    "title":dict(
        automargin=True,
        x=0.5,
        xanchor="center",
        yanchor="top",
        y=0.98
        ),
    "margin":dict(t=50, l=10, r=10)
}

Supplementary Figure 5

A - MLP AUROC on various classification tasks

For AUROC metrics, see Figure 2A,B

B - chrY signal per sex: Average EpiRR z-score per assay

Load 10-fold cross-validation results.

Code
sex_10fold_dir = (
    base_data_dir
    / "training_results"
    / "dfreeze_v2"
    / "hg38_100kb_all_none"
    / f"{SEX}_1l_3000n"
    / "10fold-oversampling"
)
PathChecker.check_directory(sex_10fold_dir)

split_results: Dict[str, pd.DataFrame] = split_results_handler.read_split_results(
    sex_10fold_dir
)
concat_results_10fold: pd.DataFrame = split_results_handler.concatenate_split_results(split_results, depth=1)  # type: ignore
concat_results_10fold = split_results_handler.add_max_pred(concat_results_10fold)
concat_results_10fold = metadata_handler.join_metadata(concat_results_10fold, metadata_v2)

Load chrY signal values.

Code
chrY_dir = base_data_dir / "chrY" / "dfreeze_v2_stats"
PathChecker.check_directory(chrY_dir)

chrY_df = pd.read_csv(chrY_dir / "chrY_zscores.csv")

Merge previous dataframes.

Code
cross_val_analysis = pd.merge(
    concat_results_10fold,
    chrY_df,
    left_index=True,
    right_on="filename",
    suffixes=("", "_DROP")
    )
cross_val_analysis.drop(
    columns=[c for c in cross_val_analysis.columns if c.endswith("_DROP")], inplace=True
)

Define function zscore_per_assay to compute and graph the metric for each assay instead of globally.

Code
def zscore_per_assay(
    zscore_df: pd.DataFrame, logdir: Path | None = None, name: str | None = None
) -> None:
    """
    Plot the z-score distributions per assay.

    Does not include pval and raw tracks.

    Args:
        zscore_df: The dataframe with z-score data.
    """
    zscore_df = zscore_df.copy(deep=True)

    # Remove pval/raw tracks + rna unstranded
    zscore_df = zscore_df[~zscore_df["track_type"].isin(["pval", "raw", "Unique_raw"])]

    # Merge rna protocols
    zscore_df.replace({ASSAY: ASSAY_MERGE_DICT}, inplace=True)

    # Remove NAs
    metric_label = "chrY_zscore_vs_assay_w_track_exclusion"
    zscore_df = zscore_df[zscore_df[metric_label] != "NA"]

    assay_sizes = zscore_df[ASSAY].value_counts()
    assays = sorted(assay_sizes.index)

    x_title = "Sex z-score distributions per assay"
    fig = make_subplots(
        rows=1,
        cols=len(assays),
        shared_yaxes=True,
        x_title=x_title,
        y_title="z-score",
        horizontal_spacing=0.02,
        subplot_titles=[
            f"{assay_label} ({assay_sizes[assay_label]})" for assay_label in assays
        ],
    )

    for i, assay_label in enumerate(sorted(assays)):
        sub_df = zscore_df[zscore_df[ASSAY] == assay_label]

        hovertext = [
            f"{epirr}: z-score={z_score:.3f}, pred={pred:.3f}"
            for epirr, pred, z_score in zip(
                sub_df["EpiRR"],
                sub_df["Max pred"],
                sub_df[metric_label],
            )
        ]
        hovertext = np.array(hovertext)

        sub_df.reset_index(drop=True, inplace=True)
        y_values = sub_df[metric_label].values

        female_idx = np.argwhere((sub_df[SEX] == "female").values).flatten()
        male_idx = np.argwhere((sub_df[SEX] == "male").values).flatten()

        fig.add_trace(
            go.Box(
                name=assay_label,
                y=y_values[female_idx],
                boxmean=True,
                boxpoints="all",
                hovertemplate="%{text}",
                text=hovertext[female_idx],
                marker=dict(
                    size=2,
                    color=sex_colors["female"],
                    line=dict(width=0.5, color="black"),
                ),
                fillcolor=sex_colors["female"],
                line=dict(width=1, color="black"),
                showlegend=False,
                legendgroup="Female",
            ),
            row=1,
            col=i + 1,
        )

        fig.add_trace(
            go.Box(
                name=assay_label,
                y=y_values[male_idx],
                boxmean=True,
                boxpoints="all",
                hovertemplate="%{text}",
                text=hovertext[male_idx],
                marker=dict(
                    size=2, color=sex_colors["male"], line=dict(width=0.5, color="black")
                ),
                fillcolor=sex_colors["male"],
                line=dict(width=1, color="black"),
                showlegend=False,
                legendgroup="Male",
            ),
            row=1,
            col=i + 1,
        )

    # Add a dummy scatter plot for legend
    fig.add_trace(
        go.Scatter(
            x=[None],
            y=[None],
            mode="markers",
            name="Female",
            marker=dict(color=sex_colors["female"], size=20),
            showlegend=True,
            legendgroup="Female",
        )
    )
    fig.add_trace(
        go.Scatter(
            x=[None],
            y=[None],
            mode="markers",
            name="Male",
            marker=dict(color=sex_colors["male"], size=20),
            showlegend=True,
            legendgroup="Male",
        )
    )

    fig.update_xaxes(showticklabels=False)
    fig.update_yaxes(range=[-1.5, 3], showticklabels=True)

    fig.update_layout(
        width=1500,
        height=750,
    )

    # Save figure
    if logdir:
        if name is None:
            name = "zscore_distributions_per_assay"
        fig.write_image(logdir / f"{name}.svg")
        fig.write_image(logdir / f"{name}.png")
        fig.write_html(logdir / f"{name}.html")

    fig.show()

Graph zscore per assay.

Code
zscore_per_assay(cross_val_analysis)

Supp. Fig. 5B: Distribution of average z-score signal of epigenomes (dots) over chrY per sex (female in red, male in blue) for each assay individually (showing only the fold change track type for the ChIP datasets, and the two types of WGBS and RNA-seq were merged). Dashed lines represent means, solid lines the medians, boxes the quartiles, and whiskers the farthest points within 1.5× the interquartile range.

To see the RNA-Seq and WGBS results, scroll to the right using the horizontal scrollbar at the bottom of the graph.

C - Female/Male chrY signal z-score cluster separation

Define function merged_assays_separation_distance that computes and graphs the showing separation distance between male/female zscore clusters.

Code
def merged_assays_separation_distance(
    zscore_df: pd.DataFrame, logdir: Path | None = None, name: str | None = None
) -> None:
    """Complement to figure 2E, showing separation distance (mean, median)
    between male/female zscore clusters, for ChIP-seq (core7). Grouped by EpiRR.

    Args:
        zscore_df (pd.DataFrame): The dataframe with z-score data.
        logdir (Path): The directory path to save the output plots.
        name (str): The base name for the output plot files.
    """
    metric_label = "chrY_zscore_vs_assay_track"

    # Preprocessing
    zscore_df = zscore_df.copy(deep=True)
    zscore_df.replace({ASSAY: ASSAY_MERGE_DICT}, inplace=True)

    zscore_df = zscore_df[zscore_df[ASSAY].isin(CORE7_ASSAYS)]  # type: ignore

    # Remove pval/raw tracks
    zscore_df = zscore_df[~zscore_df["track_type"].isin(["pval", "raw"])]

    # Average chrY z-score values
    mean_chrY_values_df = zscore_df.groupby(["EpiRR", SEX]).agg(
        {metric_label: "mean", "Max pred": "mean"}
    )
    mean_chrY_values_df.reset_index(inplace=True)
    if not mean_chrY_values_df["EpiRR"].is_unique:
        raise ValueError("EpiRR is not unique.")

    mean_chrY_values_df.reset_index(drop=True, inplace=True)

    distances = {"mean": [], "median": []}
    min_preds = list(np.arange(0, 1.0, 0.01)) + [0.999]
    sample_count = []
    for min_pred in min_preds:
        subset_chrY_values_df = mean_chrY_values_df[
            mean_chrY_values_df["Max pred"] > min_pred
        ]
        sample_count.append(subset_chrY_values_df.shape[0])

        # Compute separation distances
        chrY_vals_female = subset_chrY_values_df[subset_chrY_values_df[SEX] == "female"][
            metric_label
        ]
        chrY_vals_male = subset_chrY_values_df[subset_chrY_values_df[SEX] == "male"][
            metric_label
        ]

        if not chrY_vals_female.empty and not chrY_vals_male.empty:
            mean_distance = np.abs(chrY_vals_female.mean() - chrY_vals_male.mean())
            median_distance = np.abs(chrY_vals_female.median() - chrY_vals_male.median())

            distances["mean"].append(mean_distance)
            distances["median"].append(median_distance)
        else:
            distances["mean"].append(np.nan)
            distances["median"].append(np.nan)

    # Plotting the results
    fig = go.Figure()

    # Add traces for mean and median distances
    fig.add_trace(
        go.Scatter(
            x=min_preds,
            y=distances["mean"],
            mode="lines+markers",
            name="Mean Distance (left)",
            line=dict(color="blue"),
        )
    )
    fig.add_trace(
        go.Scatter(
            x=min_preds,
            y=distances["median"],
            mode="lines+markers",
            name="Median Distance (left)",
            line=dict(color="green"),
        )
    )

    # Add trace for number of files
    fig.add_trace(
        go.Scatter(
            x=min_preds,
            y=np.array(sample_count) / max(sample_count),
            mode="lines+markers",
            name="Proportion of samples (right)",
            line=dict(color="red"),
            yaxis="y2",
        )
    )

    fig.update_xaxes(range=[0.499, 1.0])

    # Update layout for secondary y-axis
    fig.update_layout(
        title="Separation Distance of chrY z-scores male/female clusters - ChIP-Seq",
        xaxis_title="Average Prediction Score minimum threshold",
        yaxis_title="Z-score Distance",
        yaxis2=dict(title="Proportion of samples", overlaying="y", side="right"),
        yaxis2_range=[0, 1.001],
        legend=dict(
            x=1.08,
        ),
    )

    # Save figure
    if logdir:
        if name is None:
            name = "zscore_cluster_separation_distance"
        fig.write_image(logdir / f"{name}.svg")
        fig.write_image(logdir / f"{name}.png")
        fig.write_html(logdir / f"{name}.html")

    fig.show()

Graph.

Code
merged_assays_separation_distance(cross_val_analysis)

Supp. Fig. 5C: Effect of a prediction score threshold on the aggregated mean (blue) and median (green) sex z-score male/female cluster distances, as well as and corresponding file subset size (red) of ChIP-related assays from panel B.

D - GP-Age, all samples

See Figure 2F section.

E - Epilogos sex genes chromatin states

Images extracted from Epilogos viewer, using specified coordinates (XIST and FIRRE positions), and:

  • View mode: Paired
  • Dataset: IHEC
  • Pairwise: Male VS Female 100 samples
  • Saliency Metric: S1

Supp. Fig. 5E: Epilogos pairwise comparisons of male (top) vs female (bottom) showing portions of important regions for the Sex classifier, including the XIST (left) and FIRRE (right) genes.

See Annex A for a more detailled Epilogos color legend.

F - Genome browser for biospecimen important regions

Supp. Fig. 5F: Genome browser representation of the important regions shown in Figure 2I.

Supplementary Figure 6 - Prediction score threshold impact

For full code, since the processing is more complex, see src/python/epiclass/utils/notebooks/paper/confidence_threshold.ipynb (permalink).

Supplementary Figure 6: Impact of prediction score threshold on performance metrics. Impact of prediction score threshold on accuracy, F1-score and number of files for both EpiATLAS cross-validation performance and inference on datasets from other databases with provided or extracted labels. Performance for Assay, Sex, Cancer, Biomaterial type and Life stage classifiers are shown, for EpiATLAS, ENCODE core/non-core, ChIP-Atlas and Recount3 datasets. The number of classes (C) and the number of files analyzed (N) used to calculate the performances are shown at the bottom for each graph. The 11 classes of the Assay classifiers for EpiATLAS correspond to the six ChIP-Seq histone modifications, their control Input file, and two protocols of both RNA-Seq and WGBS, while for the 9 classes of ENCODE the two protocols were grouped, and indeed only the seven ChIP-related assays were used for ChIP-Atlas and RNA-Seq for Recount3. For the Sex classifier the third class corresponds to ‘mixed’, absent for ENCODE. The Cancer classifier is binary (where non-cancer is a mix of healthy and other diseases). For the Biomaterial classifier the ‘primary cell culture’ class is missing from all public sources (but ‘primary cell’, ‘primary tissue’ and ‘cell line’ are present), while the three classes (perinatal, pediatric, adult) were always used for the Life stage classifier.

Supplementary Figure 7 - Biospecimen classifier - ChromScore for high-SHAP regions

For full code, since the processing is more complex, see src/python/epiclass/utils/notebooks/analyze_hdf5_vals.ipynb (permalink), particularly section “ChromScore hdf5 values”.

This is the breakdown per cell type of Figure 2G.

Supplementary Figure 7: Class-Specific ChromScore Distributions in Biospecimen Classifier Regions Identified by SHAP. Distribution of the average ChromScore values over the important Biospecimen classifier regions according to SHAP (blue) compared to the global distribution (pink), for each biospecimen class sorted according to the EpiATLAS order. The number of files and regions analyzed to calculate the p-values are shown. ChromScore values for each region are averaged over files of each class (one average value per feature). Statistical significance was assessed using a two-sided Welch’s t-test where the number of regions was above 30, otherwise the reported p-value is the worst of Welch’s t-test and non-parametric Brunner-Munzel. Dashed lines represent means, solid lines the medians, boxes the quartiles.

Annex A - Epilogos color legend

  • biv = bivalent
  • Tx = Transcription
  • Repr = Repressed
  • PC = Polycomb